Skip to content

trace: add barebones ptrace setup #4401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025
Merged

Conversation

nia-e
Copy link
Contributor

@nia-e nia-e commented Jun 15, 2025

Per the review in #4326, this implements the absolute bare minimum ptrace bits; no access is actually intercepted or logged, but when running in native-lib mode Miri will now fork itself and trace the child process unless this is disabled via CLI flag. Currently the docs are left the same as in the original PR i.e. they might reference things that aren't implemented yet, but I can change that.

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, lunchtime, will review more later

src/bin/miri.rs Outdated
#[cfg(target_os = "linux")]
if !miri_config.native_lib.is_empty() && !miri_config.force_old_native_lib {
// FIXME: This should display a diagnostic / warning on error
// SAFETY: No other threads have spawned yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun. If we register the ctlrc handler before this then suddenly it's UB? Not sure how to improve this, maybe there's a way to ask the process how many threads it has or sth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, to be fair we can play a little fast and loose here. The docs for fork() claim that after forking in a multithreaded context the child process must only do async-signal-safe things, but that's in a "it might be safe to do other things but it depends on what the other threads are doing" way, not in an "it's always UB to call async-signal-unsafe code" way. As long as we know what the ctrlc handler / other threads are doing, this is still sound to do in a multithreaded context. I'll check the code but given we're talking about an interrupt handler, I find it extremely unlikely that it's an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, leaving some more info on this safety comment about what you just told me works for me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK fork suspends/deletes all threads except for the one doing the fork. So why does it matter what those other threads do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with fork in a concurrent program is that those other threads may hold locks, which will now never be released. That's why async-signal-safety matters -- it's a lot like a signal handler, where the corresponding main thread may hold locks that will never get released while the signal handler runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RalfJung should we move this module and IsolatedAlloc into a separate crate within this repo that we depend on via a path dependency? Should speed up edit cycles by a small amount and generally improve separation (and thus allow unit testing nicely and potentially moving out of tree at some point). It worked for ui_test 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @RalfJung

thoughts?

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this lgtm. Really cool that it's possible

@nia-e nia-e force-pushed the barebones-ptrace branch from 17ff063 to a390068 Compare June 16, 2025 14:39
@nia-e
Copy link
Contributor Author

nia-e commented Jun 16, 2025

Think this all addresses the review, thanks!

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase and squash

@RalfJung
Copy link
Member

RalfJung commented Jun 18, 2025 via email

Apply suggestions from code review

Co-authored-by: Oli Scherer <github35764891676564198441@oli-obk.de>

review comments

fix possible hang
@nia-e nia-e force-pushed the barebones-ptrace branch from 5171739 to e815550 Compare June 18, 2025 09:29
@nia-e
Copy link
Contributor Author

nia-e commented Jun 18, 2025

In transit currently so I can't build Miri locally to test this but if CI passes this should be fine, ty!

Out of sheer curiosity regarding the code which caused the merge conflict, is there any semantic difference between (*t_list).deref() and just (**t_list)? Can't say I've seen the former pattern before

@oli-obk
Copy link
Contributor

oli-obk commented Jun 18, 2025

Oh that was fallout from the metasized work. This may just be stuff from early drafts we forgot to fix later or it's something about inference. Unsure. But it should almost always be equivalent

@oli-obk oli-obk enabled auto-merge June 18, 2025 09:43
@oli-obk oli-obk added this pull request to the merge queue Jun 18, 2025
Merged via the queue into rust-lang:master with commit a87c442 Jun 18, 2025
8 checks passed
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a bunch of questions and comments. Some of them are resolved in #4418, but a bunch are clarification questions, so I'd appreciate a follow-up PR extending the comments here and possibly doing some refactoring.

/// Returns a vector of page addresses managed by the allocator.
pub fn pages(&self) -> Vec<usize> {
let mut pages: Vec<_> =
self.page_ptrs.clone().into_iter().map(|p| p.expose_provenance()).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates the buffer for these pages twice... that seems rather unnecessary.

@@ -227,10 +227,11 @@ impl rustc_driver::Callbacks for MiriCompilerCalls {
} else {
let return_code = miri::eval_entry(tcx, entry_def_id, entry_type, &config, None)
.unwrap_or_else(|| {
#[cfg(target_os = "linux")]
miri::register_retcode_sv(rustc_driver::EXIT_FAILURE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done here but not all the other places where we exit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to have something to do with the way abort_if_errors() exits not properly getting us a return code. I'm unsure of what the root cause is - I spent a lot of time trying to debug that but it seems to be the only thing that causes this behaviour, so I special-cased it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort_if_errors doesn't exit, it unwinds. catch_with_exit_code then turns that into an exit.

I think you should be able to remove the random register_retcode here, and then once #4406 lands add this inside the new fn exit there.

Or does ptrace halt execution on an unwind? It should just resume execution then to let the program run its normal course.

// thread in an async-signal-unsafe way such as by accessing shared
// semaphores, etc.; the handler only calls `sleep()` and `exit()`, which
// are async-signal-safe, as is accessing atomics
let _ = unsafe { miri::init_sv() };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we just ignore errors here...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it errored and the sv process failed to init, then we catch that when calling poll() on it. I would like this to emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR

Copy link
Member

@RalfJung RalfJung Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we dynamically fall back to calling native code without fine-grained tracing if the setup fails?

That makes sense, but having a let _ then does not. Also, there should be a comment.

emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR

If you have confusing things like let _ instead you need to explain those, or at least add FIXME(ptrace): emit a warning here or so.

EDIT: There actually is a FIXME, fair. I think I'd have rather seen this staged a bit differently but that's getting into very subjective territory. ;)

/// receiving back events through `get_events`.
///
/// # Safety
/// The invariants for `fork()` must be upheld by the caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this safety comment. It should at least provide an easily clickable references to those fork invariants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that, sure thing.

// handler), they will not interact with anything on the main rustc/Miri
// thread in an async-signal-unsafe way such as by accessing shared
// semaphores, etc.; the handler only calls `sleep()` and `exit()`, which
// are async-signal-safe, as is accessing atomics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sounds like this relying on the ctrlc internal implementation details that could change any time...?

I also have no clue why we are talking about threads here, but that may be because init_sv doesn't have a self-contained safety comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it somewhat hard to imagine how the ctrlc handler could realistically be changed to act in ways that may break the safety invariants for signal safety, but for clarity I'll add a better safety comment for init_sv.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is that I don't think it even matters what the ctrlc handler does. I think you misunderstood the safety requirements of fork.

// reason to use a different one
let mut curr_pid = init_pid;

// There's an initial sigstop we need to deal with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does that come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_sv() raises a sigstop on the child arm, to wait in place and make sure that the parent process can ptrace it properly. If it can't, then the child is killed instead of being resumed from sigstop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we need documentation somewhere for the overarching protocol of who raises which signal when and what happens then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put that in trace/messages.rs but I can expand on it if it's insufficient as-is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here runs before the main event loop, the comment in messages.rs explains the body of the loop IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put it in the docs for init_sv then since that's the most obvious place I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the protocol, I think it makes sense to have the entire protocol in one place somewhere. So I'd add it to messages.rs. The text there could also make it more clear that those steps there are being repeated arbitrarily often, and which message types are used for which of these messages.

confirm_tx.send(Confirmation).unwrap();
// We can't trust simply calling `Pid::this()` in the child process to give the right
// PID for us, so we get it this way
curr_pid = wait_for_signal(None, signal::SIGSTOP, false).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's ending this SIGSTOP and where? Seems like there's a protocol here that I haven't seen documented so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sigstop that the child raises in start_ffi(). Then calling ptrace::syscall() is what ends it (it's the same as ptrace::cont() but also pauses the child when it enters or exits a syscall; right now unused, but will be important for the malloc tracing later)

Comment on lines +209 to +210
// Child entered a syscall; we wait for exits inside of this, so it
// should never trigger on return from a syscall we care about
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comments. Who's waiting for what where and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, this is wrong. I should have edited this when I gutted the malloc tracing bits, apologies

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very surprised that this passed CI since we should be running this on macOS as well. Any idea what is happening here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cfg(linux)'ed all of the trace-related bits since macOS has a very barebones (bad) ptrace implementation, so when running on it it'll just use the old behaviour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so macos doesn't get the fine-grained ptrace. But macos apparently still prints the error that should only be printed when using fine-grained ptrace. So there's clearly a bug here, and your reply doesn't explain it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shims/trace is a bit ambiguous IMO, let's move this inside a native_lib module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants